Reading several files: alignement (e.g, YEAST case)

In this tutorial, we show how to read several data sets and save the results in a unique file

In the YEAST case, we have 36 experiments stored in 6 files called alpha0, alpha1, alpha5 , alpha10, alpha20, alpha45

We first want to read all all them and build a unique dataframe. This can be done using the class called MassSpecAlignmentYeast



In [1]:

    
%pylab inline
from msdas import *
from msdas import yeast









    



Populating the interactive namespace from numpy and matplotlib
Couldn't import dot_parser, loading of dot files will not be possible.

By default if you read a file called alpha, columns with measurements are renamed with the filename. E.g., a column called t0 is renamed as alpha0_t0. This is to avoid issue with identical names over several files (t0 may appear in all files). If you have specific prefixes to append, they can be provided like in the following examples.



In [2]:

    
filenames = yeast.get_yeast_filenames()



In [3]:

    
import pandas as pd
df1 = pd.read_csv(filenames[0])
df2 = pd.read_csv(filenames[1])
df1.columns









    Out[3]:





Index([u'Protein', u' Psite', u' Sequence', u' t0', u' t1', u' t5', u' t10',
       u' t20', u' t45'],
      dtype='object')



In [4]:

    
df2.columns









    Out[4]:





Index([u'Protein', u' Psite', u' Sequence', u' t0', u' t1', u' t5', u' t10',
       u' t20', u' t45'],
      dtype='object')



In [5]:

    
m = MassSpecAlignmentYeast(filenames, prefixes=["a0", "a1", "a5", "a10", "a20", "a45"], verbose=False)

We have merger the 6 yeast data sets altogether. The data is now available as a dataframe inside m.df



In [6]:

    
m.df.ix[0:3]









    Out[6]:






  
    
      
      Protein
      Sequence
      Psite
      Sequence_Phospho
      a0_t0
      a0_t1
      a0_t5
      a0_t10
      a0_t20
      a0_t45
      ...
      a20_t5
      a20_t10
      a20_t20
      a20_t45
      a45_t0
      a45_t1
      a45_t5
      a45_t10
      a45_t20
      a45_t45
    
  
  
    
      0
      DIG1
      DGNLASSNSAHFPPVANQNVK
      S126+S127
      DGNLAS(Phospho)SNSAHFPPVANQNVK
      0.00041509
      0.00039651
      0.0006711
      0.00060249
      0.00043997
      0.00041787
      ...
      0.001149
      0.000917
      0.000902
      0.001009
      0.00028876
      0.0003
      0.00027013
      0.00035849
      0.00036712
      0.00031307
    
    
      1
      DIG1
      SAPAQVTQHSK
      S142
      S(Phospho)APAQVTQHSK
      0.00018739
      0.00018479
      0.0002666
      0.00020245
      0.00013835
      0.00022575
      ...
      0.001135
      0.000899
      0.001064
      0.001241
      0.0011441
      0.0013638
      0.0012374
      0.001091
      0.0014252
      0.001707
    
    
      2
      DIG1
      VNDSYDSPLSGTASTGK
      S272
      VNDSYDS(Phospho)PLSGTASTGK
      0.00033752
      0.0003301
      0.00053798
      0.00050547
      0.00038083
      0.00032833
      ...
      0.000349
      0.000319
      0.000314
      0.000232
      0.0001779
      0.000208
      0.000122
      0.00021177
      0.00020337
      0.0002206
    
    
      3
      DIG1
      VNDSYDSPLSGTASTGK
      S272^S275
      VNDSYDS(Phospho)PLS(Phospho)GTASTGK
      4.23e-05
      4.7e-05
      7.84e-05
      4.92e-05
      4.16e-05
      3.58e-05
      ...
      0.000104
      0.000075
      0.000063
      0.000061
      6.4e-05
      6.61e-05
      7.44e-05
      6.63e-05
      6.18e-05
      5.05e-05
    
  

4 rows × 40 columns



In [7]:

    
m.df.columns









    Out[7]:





Index([u'Protein', u'Sequence', u'Psite', u'Sequence_Phospho', u'a0_t0',
       u'a0_t1', u'a0_t5', u'a0_t10', u'a0_t20', u'a0_t45', u'a1_t0', u'a1_t1',
       u'a1_t5', u'a1_t10', u'a1_t20', u'a1_t45', u'a5_t0', u'a5_t1', u'a5_t5',
       u'a5_t10', u'a5_t20', u'a5_t45', u'a10_t0', u'a10_t1', u'a10_t5',
       u'a10_t10', u'a10_t20', u'a10_t45', u'a20_t0', u'a20_t1', u'a20_t5',
       u'a20_t10', u'a20_t20', u'a20_t45', u'a45_t0', u'a45_t1', u'a45_t5',
       u'a45_t10', u'a45_t20', u'a45_t45'],
      dtype='object')



In [8]:

    
m.df.shape









    Out[8]:





(57, 40)



In [9]:

    
r = readers.MassSpecReader(m)
from easydev import TempFile
f = TempFile() # a temporary named file
r.to_csv(f.name)
f.delete()









    



INFO:root:Renaming psites with ^ character
INFO:root:Replacing zeros with NAs
INFO:root:-- Removing 0 rows with ambigous protein names:
INFO:root:--------------------------------------------------
WARNING:root:Rebuilding identifier in the dataframe. MERGED prefixes will be lost



In [10]:

    
r.plot_phospho_stats()



In [ ]:

	Protein	Sequence	Psite	Sequence_Phospho	a0_t0	a0_t1	a0_t5	a0_t10	a0_t20	a0_t45	...	a20_t5	a20_t10	a20_t20	a20_t45	a45_t0	a45_t1	a45_t5	a45_t10	a45_t20	a45_t45
0	DIG1	DGNLASSNSAHFPPVANQNVK	S126+S127	DGNLAS(Phospho)SNSAHFPPVANQNVK	0.00041509	0.00039651	0.0006711	0.00060249	0.00043997	0.00041787	...	0.001149	0.000917	0.000902	0.001009	0.00028876	0.0003	0.00027013	0.00035849	0.00036712	0.00031307
1	DIG1	SAPAQVTQHSK	S142	S(Phospho)APAQVTQHSK	0.00018739	0.00018479	0.0002666	0.00020245	0.00013835	0.00022575	...	0.001135	0.000899	0.001064	0.001241	0.0011441	0.0013638	0.0012374	0.001091	0.0014252	0.001707
2	DIG1	VNDSYDSPLSGTASTGK	S272	VNDSYDS(Phospho)PLSGTASTGK	0.00033752	0.0003301	0.00053798	0.00050547	0.00038083	0.00032833	...	0.000349	0.000319	0.000314	0.000232	0.0001779	0.000208	0.000122	0.00021177	0.00020337	0.0002206
3	DIG1	VNDSYDSPLSGTASTGK	S272^S275	VNDSYDS(Phospho)PLS(Phospho)GTASTGK	4.23e-05	4.7e-05	7.84e-05	4.92e-05	4.16e-05	3.58e-05	...	0.000104	0.000075	0.000063	0.000061	6.4e-05	6.61e-05	7.44e-05	6.63e-05	6.18e-05	5.05e-05